Transformer fault diagnosis method based on TLR-ADASYN balanced dataset

As the cornerstone of transmission and distribution equipment, power transformer plays a very important role in ensuring the safe operation of power system. At present, the technology of dissolved gas analysis (DGA) has been widely used in fault diagnosis of oil-immersed transformer. However, in the actual scene, the limited number of transformer fault samples and the uneven distribution of different fault types often lead to low overall fault detection accuracy or a few types of fault misjudgment. Therefore, a transformer fault diagnosis method based on TLR-ADASYN balanced data set is presented. This method effectively addresses the issue of samples imbalance, reducing the impact on misjudgment caused by a few samples. It delves deeply into the correlation between the ratio of dissolved gas content in oil and fault type, eliminating redundant informations and reducing characteristic dimensions. The diagnostic model SO-RF (Snake Optimization-Random Forest) is established, achieving a diagnostic accuracy rate of 97.06%. This enables online diagnosis of transformers. Comparative analyses using different sampling methods, various features, and diverse diagnostic models were conducted to validate the effectiveness of the proposed method. In conclusion, validation was conducted using a public dataset, and the results demonstrate that the proposed method in this paper exhibits strong generalization capabilities.

intelligence diagnostics with imbalanced small samples.Currently, experts and scholars have conducted extensive research to address the imbalance in datasets, proposing solutions from both the sample and algorithm perspectives.Sample-based solutions include oversampling and undersampling methods.Undersampling achieves sample balance by removing some majority class samples but is prone to eliminating valuable information and is not widely adopted 21 .Oversampling, on the other hand, balances the dataset by generating minority class samples [22][23][24] .Algorithm-based solutions primarily include ensemble learning 25 and cost-sensitive methods 26 .The ADASYN algorithm was used to augment minority class samples in a study, further enhancing equipment fault classification performance 27 .Another study proposed enhancing sample intra-class feature aggregation by increasing the number of clusters based on imbalance degree and K-means clustering 28 .This improved sample identifiability.Although these methods have reduced the occurrence of misclassification and omission of minority class samples to some extent, they do not consider boundary samples and noise when synthesizing new samples, resulting in the problem of fuzzy classification boundaries.
To address these issues, this paper tackles the problem of recognizing and classifying imbalanced small sample data from both the sample and algorithm levels, proposing a transformer fault diagnosis method based on a TLR-ADASYN balanced dataset.Firstly, the influence of noise and boundary samples is eliminated before balancing the data.Secondly, to address the limitations of traditional diagnostic methods in characterizing complex internal fault features of transformers, multi-dimensional ratio features are constructed.These features delve deeper into the correlation between the ratios of dissolved gas contents in the oil and the state types, eliminating the impact of redundant information and improving operational efficiency.Finally, a transformer fault diagnosis model is established, and the effectiveness of the proposed method is validated through real-world data.

Synthetic oversampling of boundary samples based on Tomek link ADASYN minority-class sample synthesis technique
ADASYN is an adaptive data synthesis method proposed by He et al. 29 .The method adaptively synthesizes different numbers of new samples according to the distribution of minority samples.The specific algorithm steps are as follows.
Suppose the training set is D , which contains m samples, x i , y i , i = 1, 2, . . ., m , x i is represented as a sam- ple of the feature space X , y i ∈ Y = {−1, 1} .m s and m l represent the number of minority samples and majority samples, respectively.Hence, m s ≤ m l and m s + m l = m exist.
Calculate the total number of samples of a few classes that need to be synthesized G: where β ∈ [0, 1] is the random number of the interval, representing the unbalance degree after the generation of new data.β = 1 indicates that the positive/negative ratio after sampling is 1:1.Calculate the proportion of majority classes in K-nearest neighbors: According to the sample weight, calculate the number of new samples that need to be generated for each minority sample.
To calculate the number of samples generated for each minority sample according to g: where S i is the synthesized new sample, X i is the i-th sample in the minority sample, (x iz − x i ) is the m-dimen- sional vector representing the difference between the two minority samples, and is the random number in the [0, 1] interval.

TLR-ADASYN equilibrium dataset
Tomek 30 improved the convolutional neural network in 1976 and proposed a new framework, which undersampled the boundary samples without destroying the potential information.Two adjacent samples of different classes can be connected into a Tomek Link.Its formation process is as follows: Suppose there are two types of sample sets C 1 and C 2 , and the corresponding samples are u i (i ∈ {l, . . ., n}) and v i (i ∈ {l, . . ., m}) respectively.Define distance dist(u i , v i ) = �u i − v i � , If there are no other samples v p or u q that satisfy the conditions of dist u q , v j < dist u i , v j or dist u q , v j < dist u i , v j .Thus, u i , v j can form a pair of Tomek chain.
For each u i ∈ C 1 , find the nearest v p ∈ C 2 , form a chain l 12 set and save it.
For each v j ∈ C 2 , find the nearest C 2 , form a chain l 12 set and save it.
(1) Tomek Link reduces noise and boundary data by eliminating problematic pairs.To prevent the classifier from favoring the majority class too much, ADASYN expands the minority class data, addressing the bias issue.

Random forest
RF 31 belongs to one of the integrated algorithms and it is a set {h(X, θ k ), k = 1, 2, . . ., n} composed of k decision tree classification models, the set is extracted by Booststrap sampling method, and the final classification result is obtained by subtree voting.The steps to build an RF classification model are as follows.
Step 1 Using Booststrap sampling, samples with the same capacity are drawn from the training set N to generate the training subset.
Step 2 It is assumed that the training subset has S features, and s samples selected at random are taken as the split feature subset and split by CART algorithm.
Step 3 Repeat Step1 to Step2 for n times to generate subtree and build RF model.
Step 4 Test sets are used to verify the reliability of RF models, and the final classification results are decided by voting.

Snake optimization algorithm
Snake Optimization algorithm 32 is a new meta-heuristic algorithm proposed in 2022, which mainly simulates the foraging and reproduction behavior of snakes.The algorithm has the advantages of simple principle and good optimization performance.The specific principle is as follows.

Initialize
Snake population initialization is shown in Eq. ( 9): where X i is the position of the i-th snake; r is a random number in the range [0,1]; X max and X min are the upper and lower boundaries.
The population was divided into two groups, male and female, and Temp and Q were defined Suppose the number of males is 50% and the number of females is 50%.The population is divided into two groups: male and female.Define the temperature Temp and the amount of food Q, and find the best individual in each group.Temp and Q can be expressed by formulas (10) and (11) where t represents the current number of iterations; T is the maximum number of iterations; c 1 is a constant, usually 0.5.

Exploration phase
If Q < Threschold(0.25) ,the snake randomly selects a location to search for food and updates the location.The exploration phase is shown in Eq. ( 12): where X i,m is the male position; X rand,m is the location of the randomly selected male; rand is the random number of [0,1]; c 2 is a constant, usually 0.05; A m The ability to find food for males.

Development phase
Under conditions Q > Threschold is satisfied, if Q > Threschold(0.6) ,the snakes are in a hot state and looking for food, the position is updated as shown in Eq. ( 13): where X i,f is the position of the snake individual; X food is the optimal position of individual snake.rand is the random number of [0,1]; c 3 is a constant, usually 2.
If Q < Threschold(0.6) ,the temperature is cold, the snake will be in fight mode or mating mode.① Combat pattern (7) where X i,m is the position of the i-th male; X best,f is the best position in female snake group.rand is a random number [0,1]; FM is the male fighting force.
② Mating pattern where X i,m is the position of the i-th male; X i,f is the position of the i-th female; rand is a random number [0,1].M m and M f represent the mating ability of males and females, respectively.The specific implementation flow of SO algorithm is shown in Fig. 1.

Kernel principle component analysis
KPCA 33 is a method that transforms defective sample data into a high-dimensional space using a kernel function, then acquires essential low-dimensional data features within a linear subspace.This approach both maximizes the preservation of critical fault information and removes correlations among fault features.The specific steps can be described as follows: Mapping the faulty dataset to a high-dimensional space , forming a new dataset �(e i ) = {�(e 1 ), �(e 2 ), . . ., �(e n )}, i = 1, 2, . . ., n .Assuming the samples in the high-dimensional space are already centered, the covariance matrix is as shown in Eq. ( 17): Introducing the kernel function K * η = � T � , perform feature decomposition on the data in C, as shown in Eq. ( 18): where represents the eigenvalues, and η represents the eigenvectors.
Setting the cumulative contribution rate to 85%, arrange them in descending order and select the top c eigenvalues j j = 1, 2, . . ., c along with their corresponding eigenvectors η j j = 1, 2, . . ., c , as specified in Eq. ( 19): www.nature.com/scientificreports/When the cumulative contribution rate reaches the specified requirement, calculate the nonlinear samples G after dimensionality reduction mapping, as specified in Eq. ( 20):

Fault diagnosis flow of transformer under unbalanced small sample condition
In this paper, an effective transformer fault diagnosis method is proposed from three perspectives: category unbalance processing, feature extraction and pattern recognition.The specific flow chart is shown in Fig. 2, which mainly includes two stages: offline model training and online recognition.
The off-line model training stage is mainly divided into the following four steps.
Step 1 Standardize the collected DGA sample data, use TLR to remove the boundary data and noise of the training set, and then use ADASYN to expand the data of a few classes of samples.
Step 2 The 18-dimensional feature is constructed by using the code-free ratio method, and the feature fusion is carried out by KPCA to remove the redundant information, and then divided into the training set, verification set and test set according to the proportion.
Step 3 Optimize the parameters of n_estimators and max_depth of decision tree in RF model by SO algorithm.
Step 4 Verify the accuracy of each iteration model with verification set.When the accuracy is improved less than 0.001 after two consecutive trainings, complete the model training and save the model parameters; otherwise, re-train the model until the conditions are met.Then the test set is sent into the trained SO-RF model to check the diagnostic accuracy of the model.
The online identification stage is mainly divided into the following three steps.
Step 1 Normalize the transformer fault samples collected in real time.
Step 2 The 18-dimensional feature is constructed using the uncoded ratio method, and then the fusion feature is obtained by projecting to the best principal element.
Step 3 Feed the fusion features into the optimal classification model to identify the transformer state.www.nature.com/scientificreports/

Model evaluation index
In traditional transformer fault diagnosis, the commonly used diagnostic metric is the accuracy rate, which is a single measure and doesn't effectively distinguish between misclassifications and missed detections.To address this limitation, this paper introduces several comprehensive accuracy metrics for transformer fault diagnosis, including the recall ratio (R), precision ratio (P), Kappa coefficient, and F1 index.The recall ratio (R) represents the rate of missed detections for a specific fault type, while the precision ratio (P) represents the rate of misclassifications for a specific fault type.In practical scenarios, the recall rate may be high while the accuracy rate is low, or vice versa.To balance both aspects, the F1 index is introduced.The F1 index is a measure of the harmonic average between the recall rate and precision rate.A higher F1 value indicates better model performance.The specific formula is as follows: where TP indicates that the fault sample is determined.And determine the correct number; FP represents the number of normal sample decisions made, but the decision is wrong; FN indicates the number of normal sample decisions made, but the decision is wrong.
The Kappa coefficient formula is as follows: where P 0 is the sum of the number of correctly classified samples of each class divided by the total number of samples; Pe is the sum of the product of the actual and predicted quantities for all categories, divided by the square of the total number of samples.Generally, the results of Kappa calculation fall between [0,1] and can be divided into five groups to represent different levels of consistency, namely: very low consistency, general consistency, medium consistency, high consistency and almost complete consistency.When used as an evaluation index of the model, the closer the calculated value is to 1, the better the diagnostic effect of the model is.

Example analysis
In this paper, 338 sets of monitoring data provided by a power supply company in Zhejiang, China, were selected as a sample set, including 7 different operating states of medium discharge and overheat, low temperature overheat, high temperature overheat, partial discharge, low energy discharge, high energy discharge and normal, which were respectively represented by labels 1-

Transformer fault data preprocessing and feature selection
When the transformer fails, the composition and concentration of dissolved gas in the insulation oil will change.Therefore, the content of dissolved H 2 , CH 4 , C 2 H 4 , C 2 H 6 and C 2 H 2 in the transformer oil is used as the basis for transformer fault diagnosis.The content of each gas is normalized, as shown in formula (25): where x i and x * i are the characteristics before and after normalization; x i max and x i min represents the original minimum and maximum values before normalization.In order to deeply explore the correlation between the ratio of dissolved gas content in oil and the fault type, the 18-dimensional joint feature is constructed by using the non-coding ratio method.Where, THC Table 1.Category label and sample distribution.www.nature.com/scientificreports/

Data balancing processing
As indicated in Table 1, normal samples constituted 45.07% of the total samples, while partial discharge, lowenergy discharge, and discharging-over-heat samples represented 7.40%, 5.92%, and 2.99% of the total samples, respectively.Such data imbalance could lead to the misclassification of a few samples as normal, resulting in diminished recognition accuracy.To address this issue, this paper employs the TLR algorithm to filter out noise and boundary data from the training set.Subsequently, the ADASYN algorithm is utilized to augment the number of fault samples.The distribution of sample quantities before and after this processing is presented in Table 3.

Feature selection
To mitigate the inclusion of redundant information in fault features, Kernel Principal Component Analysis (KPCA) was utilized to integrate the constructed 18-dimensional joint features.The contribution rates and cumulative contribution rates of each principal component are visualized in Fig. 3. Within this figure, it is evident that  www.nature.com/scientificreports/ the initial principal component encompasses the majority of feature information, and as the number of principal components increases, the volume of feature information decreases.The cumulative contribution rate associated with each principal component was calculated as per Formula ( 19) and is presented in Table 4.
As illustrated in Table 4, the cumulative variance contribution rate of the first seven principal components reaches 0.876.This signifies that these initial seven principal components capture over 85% of the explanatory power inherent in all principal components.Consequently, the first seven principal components are chosen as the inputs for the transformer fault diagnosis model.To further underscore the efficacy of KPCA feature fusion, two-dimensional scatter plots are generated for distinct principal components, as visualized in Fig. 4. The scatter plot in Fig. 4 reveals that the clustering effect is most pronounced in the first and second principal components, with the clustering effect diminishing progressively for subsequent principal components.

Fault diagnosis result
Fusion features extracted from KPCA were divided into training set, test set and verification set according to the ratio of 6:2:2, as shown in Table 5.
To obtain the optimal diagnostic model, the SO algorithm was employed to optimize the n_estimators and max_depth of decision trees within the RF model.A population size of 30 and a maximum iteration count of 100 were set.The search range for the number of decision trees was (0, 100), and the search range for decision tree depth was (0, 20).The simulations in this study were conducted using MATLAB 2018b software, and the resulting confusion matrix is shown in Fig. 5. From Fig. 5, it can be observed that out of the 204 samples in the test set, 198 were correctly diagnosed, resulting in an overall accuracy of 97.06%.Specifically, the accuracy of diagnosing medium and low-temperature overheating, partial discharge, and combined discharge and overheating faults was 100%.Based on the data in the confusion matrix, the diagnostic model's precision (P), recall (R), and F1-score were calculated as 0.9704, 0.9711, and 0.9707, respectively.Additionally, the Kappa coefficient of the diagnostic model was 0.9659, indicating almost perfect agreement, further confirming the high fault recognition accuracy and excellent stability of the model proposed in this study.

Qualitative and quantitative analysis of TLR-ADASYN data equalization
To validate the effectiveness of the TRL-ADASYN sampling method, this study conducts a comprehensive performance comparison of various sampling methods, combining qualitative observations with quantitative analysis.Firstly, to visually demonstrate that the TRL-ADASYN sampling method successfully augments the sample size while preserving essential data characteristics, the study employs t-distributed Stochastic Neighbor Embedding (t-SNE) 34 to map transformer dissolved gas data into a three-dimensional space for visualization, as depicted in Fig. 6.In Fig. 6, the blue dots represent samples after applying the sampling method, while the orange dots represent samples before sampling.Within this three-dimensional coordinate graph, it becomes evident that the data distribution patterns of different fault types remain consistent both before and after the implementation of the TRL-ADASYN sampling method.Furthermore, the statistical characteristics align, providing compelling evidence for the validity and reliability of the augmented data.
Secondly, we conducted a quantitative comparison of the performance of various sampling methods, evaluating five different treatment approaches, namely, non-equilibrium dataset, random oversampling, SMOTE oversampling, ADASYN oversampling, and ROS downsampling.The resulting diagnostic outcomes are presented in Table 6.As illustrated in Table 6, the diagnostic accuracy of the original dataset, without undergoing any balancing processing, stood at 88.24%, accompanied by a Kappa coefficient of 0.8654.The adoption of oversampling or downsampling algorithms led to varying degrees of improvement in diagnostic accuracy.However, when the downsampling algorithm was employed, valuable information was lost due to the removal of a portion of the majority class sample data.Comparatively, in contrast to ADASYN, SMOTE, and random oversampling, the diagnostic accuracy of the method proposed in this paper increased by 0.59%, 1.96%, and 4.41%, respectively.Furthermore, the Kappa coefficient also witnessed an increase of 0.0057, 0.0224, and 0.0505, respectively.The experimental results conclusively demonstrate that the approach introduced in this paper effectively addresses the issue of insufficient sample distribution in certain classes, mitigating the potential decline in diagnostic accuracy caused by a model's inclination toward the majority class samples.www.nature.com/scientificreports/

Comparative analysis of diagnostic results under different characteristics
The use of KPCA feature extraction also has a significant impact on improving diagnostic accuracy.In this study, oversampled IEC three-ratio features, Rogers' four-ratio features, 18-dimensional joint features, and the first 7 dimensions of features extracted using principal component analysis were analyzed and compared, as shown in Fig. 7.In the figure, the red dots represent samples in the test set that were correctly classified, while the blue circles represent samples with their true classifications.The scattered points indicate samples misclassified as other categories, and a higher number of scattered sample points indicates lower diagnostic accuracy.From Fig. 7, it can be observed that the use of IEC three-ratio features and Rogers' four-ratio features have more scattered points compared to the 18-dimensional joint features, indicating that the 18-dimensional joint features are better at exploring the relationship between fault types and dissolved gases in the oil.Table 7 shows that the corresponding Kappa coefficients for the four different features are 0.9433, 0.9209, 0.8821, and 0.8543.Using KPCA fusion features reduced the feature dimensionality, significantly improving fault diagnosis accuracy, thus confirming the superiority of this method.

Comparative analysis of different fault diagnosis models
To illustrate the effectiveness of this diagnosis method, comparison and analysis were made with GA-XGBoost diagnosis model proposed in Ref. 35 , PSO-BiLSTM diagnosis model proposed in Ref. 36 and WOA-SVM diagnosis model proposed in Ref. 37 , and the diagnostic results were shown in Table 8.It shows the superiority of the diagnostic model proposed in this paper.The 7-dimensional fused and dimensionally reduced features were separately input into three different models, GA-XGBoost, PSO-BiLSTM, and WOA-SVM, for comparative analysis against the diagnostic model proposed in this study.The diagnostic results are shown in Fig. 8, and the model evaluation metrics are compared in Table 9.From the information presented in the figure and the table, it can be observed that the SO-RF model had the fewest misclassified samples, resulting in an accuracy improvement of 1.47%, 2.45%, and 3.43% compared to the GA-XGBoost, PSO-BiLSTM, and WOA-SVM diagnostic models, respectively.In comparison with the recognition accuracy in the original literature, the improvement was 1.91%, 1.13%, and 1.54%, respectively.Furthermore, in terms of evaluation metrics such as recall, precision, and F1 score, the method proposed in this study exhibits more stable performance compared to other models.From the perspective of the Kappa coefficient, the method presented in this study achieved a score of 0.9546, indicating almost perfect agreement.This further underscores the effectiveness of the feature extraction method and fault diagnostic model proposed in this study.

The generalization performance analysis of the model
Additional datasets were employed to assess the model's ability to generalize.Specifically, the IEC TC 10 38 public dataset was selected for this purpose.In accordance with the categorization provided in Ref. 39 , transformer fault Prediction class samples www.nature.com/scientificreports/types were classified into six categories: medium and low-temperature overheating, high-temperature overheating, low energy discharge, high energy discharge, partial discharge, and normal operation, denoted as labels 1 to 6, respectively.Leveraging the diagnostic techniques proposed in this study, the diagnostic outcomes are presented in Table 10.
As depicted in Table 10, the diagnostic accuracy for the IEC TC 10 dataset stands at 93.98%, accompanied by a Kappa coefficient of 0.9276.This underscores the robust generalization capabilities of the approach introduced in this paper when compared to the previously cited model.

Conclusion
Aiming at the problem of misjudgment and missing judgment of a few types of samples caused by unbalanced transformer fault samples, a transformer fault diagnosis method under the condition of unbalanced small samples is proposed, and the following conclusions are drawn through practical data simulation: (1) The TLR-ADASYN method adopted in this paper can effectively solve the problem of low diagnostic accuracy caused by insufficient and unbalanced transformer fault sample data.In addition, the use of KPCA for feature fusion can avoid the appearance of redundant information and further improve the accuracy of the model.( 2) Compared with GA-XGBoost, PSO-BiLSTM and WOA-SVM diagnostic models, the accuracy of SO-RF model proposed in this paper reached 96.08%, and the Kappa coefficient reached 0.9546, which were superior to other models.The results show that SO-RF model has better stability and generalization.
However, using dissolved gases in oil as an early diagnostic method for transformers, relying solely on these gases as input features is insufficient to reflect the overall condition of the transformer.Therefore, future work can collect vibration signal data as additional input for the model.Furthermore, the diagnostic model proposed in this paper did not take into account external factors and the influence of the transformer's inherent characteristics on fault diagnosis accuracy.Subsequent research should consider the impact of external factors on the fault diagnosis model.

Figure 4 .
Figure 4. Scatter plot of different principal elements.

Figure 5 .
Figure 5. Confusion matrix of fault diagnosis classification.

Figure 6 .
Figure 6.Data distribution trend of different types of faults before and after balanced processing.

Figure 7 .
Figure 7.Comparison of diagnostic results of different feature inputs.

Figure 8 .
Figure 8.Comparison of results of different diagnostic models.
7. Each operating state includes five characteristic gases, H 2 , CH 4 , C 2 H 4 , C 2 H 6 and C 2 H 2 .The number of samples for each category is shown in Table 1.

Table 2 .
Characteristic coding and characteristic quantity of dissolved gas in oil.

Table 3 .
Comparison before and after fault sample preprocessing.

Table 4 .
Cumulative contribution rates of variance for each principal components.

Table 5 .
Distribution of sample data.

Table 6 .
Diagnostic results under different sampling methods.

Table 7 .
Comparison of Kappa coefficients of different characteristics.

Table 8 .
Comparison of diagnostic results of different models.

Table 9 .
Comparison of model evaluation indexes.

Table 10 .
Diagnostic results under the IEC TC 10 data set.